Iterative cross-training: An algorithm for learning from unlabeled Web pages

نویسندگان

  • Nuanwan Soonthornphisaj
  • Boonserm Kijsirikul
چکیده

The paper presents a learning method, called Iterative Cross-Training (ICT) , for classifying Web pages in two classification problems, i.e., (1) classification of Thai/non-Thai Web pages, and (2) classification of course/non-course home pages. Given domain knowledge or a small set of labeled data, our method combines two classifiers that are able to effectively use unlabeled examples to iteratively train each other. We compare ICT against the other learning methods: supervised word segmentation classifier, supervised naïve Bayes classifier, and co-training-style classifier. The experimental results, on two classification problems, show that ICT gives better performance than those of the other classifiers. One of the advantages of ICT is that it needs only a small set of pre-labeled data or no pre-labeled data in the case that domain knowledge is available.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Page Classification using Iterative Cross-Training Algorithm

The paper presents a generalization of Iterative Cross-Training algorithm (ICT) which was previously applied to Thai Web pages identification [1]. The main concept of ICT is to iteratively train two sub-classifiers by using unlabeled examples in crossing manner. In this paper, we extend the algorithm in order to classify Web pages into course or non-course ones, which is a more challenging prob...

متن کامل

Web Page Classification Using Relational Learning Algorithm and Unlabeled Data

Applying relational tri-training (R-tri-training for short) to web page classification is investigated in this paper. R-tri-training, as a new relational semi-supervised learning algorithm, is well suitable for learning in web page classification. The semi-supervised component of R-tritraining allows it to exploit unlabeled web pages to enhance the learning performance effectively. In addition,...

متن کامل

Coupled Bayesian Sets Algorithm for Semi-supervised Learning and Information Extraction

Our inspiration comes from Nell (Never Ending Language Learning), a computer program running at Carnegie Mellon University to extract structured information from unstructured web pages. We consider the problem of semi-supervised learning approach to extract category instances (e.g. country(USA), city(New York)) from web pages, starting with a handful of labeled training examples of each categor...

متن کامل

Web pages ranking algorithm based on reinforcement learning and user feedback

The main challenge of a search engine is ranking web documents to provide the best response to a user`s query. Despite the huge number of the extracted results for user`s query, only a small number of the first results are examined by users; therefore, the insertion of the related results in the first ranks is of great importance. In this paper, a ranking algorithm based on the reinforcement le...

متن کامل

The information regularization framework for semi-supervised learning

In recent years, the study of classification shifted to algorithms for training the classifier from data that may be missing the class label. While traditional supervised classifiers already have the ability to cope with some incomplete data, the new type of classifiers do not view unlabeled data as an anomaly, and can learn from data sets in which the large majority of training points are unla...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. J. Intell. Syst.

دوره 19  شماره 

صفحات  -

تاریخ انتشار 2004